About the project

In this project I will be practising how to work with R. It is part of the Open Data Science Course. Check out my github repository.


RStudio Exercise 2 - Regression and model validation

In this exercise I have been working on a data set about students learning motivation. In preparation I transformed the raw data into a data set suitable for analysis. That process is called data wrangling. After that I was analysing the data statistically in order to investigate the relationship between certain variables.

2.1.Reading students2014 data into R and exploring its dimensions and structure.

learning2014 <- read.csv("/Users/eva-mariaroth/Documents/GitHub/IODS-project/learning2014.csv", header = TRUE)
dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
##   gender Age attitude     deep  stra     surf Points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21

The dataset consists of 166 observations of 7 variables (columns) related to the learning behaviour of students. The variables are:
Gender : character values, M, F, for male and female
Age : age of the students, numeric data type
Attitude : Global attitude towards statistics, numeric data type, likert scale 1-5
Deep : multiple questions combined that explore deep learning, numeric data type, likert scale 1-5
Stra: multiple questions comnined that explore strategic learning, numeric data type, likert scale 1-5
Surf: multiple questions combined that explore surface learning, numeric data type, likert scale 1-5
Points: points of students, numeric data type

2.2.Showing graphical overview of the data

Installing and accessing the packages ggplot2, which is used to create graphics and GGally, an extension to ggplot2.

install.packages("ggplot2")
install.packages("GGally")
library("ggplot2")

library("GGally")

Exploring the data by printing out a summary:

summary(learning2014)
##  gender       Age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Now we can get an overview about our data. There were 110 women and 56 men questioned for the data set, with an average age of about 25.5 years, the youngest beeing 17 and the oldest 55 years old. The Median of the age is 22. Most of the students are between 21 and 27 years old. The averaged attitude towards statistics is 3,14. The format ov the usual likert scale is:
1 Strongly disagree
2 Disagree
3 Neither agree nor disagree
4 Agree
5 Strongly agree
The mean of the strateic learning is about 3,68 the one considering strategic learning is about 3,12 and the mean of the questions considering surface learning is 2,78. The average amount of points is 22,72 with the minimum amount beeing 7 points and the maximal amount 33.

Exploring the data regarding the relationships between them:

pairs(learning2014[-1], col = learning2014$gender)

The red circles represent the male and the black the female participants and their attitude towards learning. In some of these scatter plots you can already assume a correlation. For example the number of points also seems to increase when the attitude towards statistics increases. A more detailed overview about the variables, distributions and relationships we can get using ggplot2 and GGally.

p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

In this plot matrix I can for example see how strongly two variable are correlating. There is a correlation between attitude and points with a correlation factor of 0.451. There is little or no correlation between the points and the deep learning. I can read this from the small correlation factor of 0.01.

2.3 Multiple regression model

If we want to examine the strength of the relationship between several variables (explanatory variables) on one target variable (dependet varaible) we can do this with a regression model. The model is written with the formula y ~ x1 + x2 + x3.

Alpha corresponds to the point where the regression line intercepts the y-axis. Beta corresponds to the slope of the regression line. The model can also be used to make predictions. It is assumed that the relationship between y and the explanatory variables is linear. Our target variable is “Points”. As explanatory variables I chose the ones with the highest correlation coefficient considering its relation to “Points”: attitude, strategic learning and surface learning. Hence our model looks like this:

my_model <- lm(Points ~ attitude + stra + surf, data=learning2014)

The summary can be printed with

summary(my_model)
## 
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

It shows the formula that created the model and underneath a five point summary of the residuals of the model. In the table the first column “Estimate” estimates the effect (𝝱 paramter) of the explanatory variables on the dependent variable. For attitude it is estimated with 3.3952, for strategic learning with 0,8531 and for surface learning with -0,5861. The next column gives the standard error, followed by the t-value and the p-value, that are statistical tests. A very low p-value proves a statistical relationship between the dependant and the explanatory variable. In our case a statistical relationship for “Points” with the values strategic learning and surface learning does not exist. Hence we rewrite the model only including attitude.

my_model2 <- lm(Points ~ attitude, data = learning2014)
summary(my_model2)
## 
## Call:
## lm(formula = Points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The regression line of “Points” correlating with “attitude” intercepts the y-axis at 11.63 an has a slope of 3.5.

2.4 Diagnostic plots

Residuals vs Fitted values

plot(my_model2, which = 1)

In the linear regression model it is assumed that the errors are normally distributed and the size of an error should not dependent on the explanatory variables. This assumption can be explored with a scatter plot of residuals versus model predictions. If the points are randomly spread, without a pattern the assumtion can be proven. Here this is the case. The assumption can be proven.

Normal QQ-plot

plot(my_model2, which = 2)

The QQ-plot explores the assumption that the errors of the model are normally distributed.The points fall well on the line, therefor there is a reasonable fit to the normality assumtion.

Residuals vs Leverage

plot(my_model2, which = 5)

Leverage measures how much impact a single observation has on the model. Plotting Residuals vs. Leverage helps to identify if single observations have a unusually high impact on the model. In our example there is not a single observation that stands out.


RStudio Exercise 3 - Logistic Regression

In this excercise I am analysing a data set about the drinking behaviour of Portuguese students. At first two dataset were combined in order to analyse them (data wrangling). Later certain variables and their relationships with the students drinking behavior are analysed. The drinking behaviour was converted into a discrete variable. For a discrete target variable, logistic regression is a suitable method for analysis.

3.1 Reading and describing the data

alc <-read.csv("/Users/eva-mariaroth/Documents/GitHub/IODS-project/alcjoint", header=TRUE, sep=",")

The dataset consists of 382 observations of 35 variables. See below the the printed names of the variables. For this dataset, two datasets were combined (source: UCI Machine Learning Repository). The data was collected in two Portuguese secondary schools and comprises various information about the students (school, sex, age, address), their family backround, their attitude towards studying and education, how they spend their freetime, their alcohol consumption and their grades (G1, G2, G3). Two questions, concerning the workday alcohol consumption (Dalc) and the weekend alcohol consumtpion (Walc), were combined during the data wrangling to the overall alcohol consumption (alc_use). Another variable concerning high alcohol consumption (high_use) was created. It contains all the observations with a consumption higher then 2, with 1 beeing very low alcohol consumption and 5 beeing very high consumtion.

dim(alc)
## [1] 382  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

3.2 Hypotheses

Quality of familiy relationships (“famrel”, numeric: from 1 very bad to 5 excellent)
Hypothesis: Students with problematic family relationships have a higher alcohol consumption.
Number of school absences (“absences”, numeric from 0 to 93)
Hypothesis: Students with a high alcohol consumtion are more often skipping school.
Number of past class failures (“failures”, numeric: n if 1<=n<3, else 4)
Hypothesis: High alcohol consumption leeds to a higher amount of failed classes.
Final grade (“G3”“, numeric:from 0 to 20, output target)
Hypothesis: Students with higher alcohol consumption have lower final grades.

3.3 Visualization of the selected variables

Visualization with barplots

selected_variables <- subset (alc, select = c("famrel", "absences", "failures", "G3", "alc_use"))
library(tidyr)
library(ggplot2)
gather(selected_variables) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Now we want to see, how the four selected variables influence on the alcohol consumption. As the grades are approximately normally distributed, we can show the relationship between grades and high alcohol consumtpion nicely with a boxplot.

Visualization with boxplot

g1 <- ggplot(alc, aes(high_use, y = G3, col =sex))
g1 + geom_boxplot() + ylab("Grades") + ggtitle("Effects on high alcohol consumption on grades")

We can see, that the male students that have an alcohol consumption of more then two, have lower grades then the ones who drink less. For women there seems to be no difference.

Cross-tabulation

The relation between the high alcohol consumption and the selected variables can also be explored numerically with cross-tabulations. Below we will examine the correlation between a variable and high amount of drinking by comparing the means of the answers, of students that have a high alcohol consumption (TRUE), with the ones of the students that do not have a high consumption (FALSE).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
alc %>% group_by (high_use) %>% summarise (count = n(), mean_grade = mean (G3))
## # A tibble: 2 x 3
##   high_use count mean_grade
##      <lgl> <int>      <dbl>
## 1    FALSE   268   11.73507
## 2     TRUE   114   10.80702

tab. 1: Effect of high alcohol consumption on the final grades. We can see from the means, that students with a high alcohol consumption have lower average final grades, with a difference of 0.9 compared to the ones with low alcohol consumption. This supports our hypothesis.

alc %>% group_by (high_use) %>% summarise (count = n(), mean_grade = mean (famrel))
## # A tibble: 2 x 3
##   high_use count mean_grade
##      <lgl> <int>      <dbl>
## 1    FALSE   268   4.003731
## 2     TRUE   114   3.780702

tab. 2: Relationship between the quality of family relationships and the high consumption of alcohol. The quality of family relationships ranges from 1 - very bad to 5 - excellent. As we can see from the means, the quality of family relationships of the people with a higher alcohol consumtion tend to be slightly worse (3.7) compared to the ones of the people who drink less (4.0). This confirms our hypothesis.

alc %>% group_by (high_use) %>% summarise (count = n(), mean_grade = mean (absences))
## # A tibble: 2 x 3
##   high_use count mean_grade
##      <lgl> <int>      <dbl>
## 1    FALSE   268   3.705224
## 2     TRUE   114   6.368421

tab. 3: Relationship between drinking behaviour and number of absences. Students that have a high alcohol consumption have been nearly twice (1,7 times) as many times absent as the students who do not have a high consumption. that supports our hypothesis.

alc %>% group_by (high_use) %>% summarise (count = n(), mean_grade = mean (failures))
## # A tibble: 2 x 3
##   high_use count mean_grade
##      <lgl> <int>      <dbl>
## 1    FALSE   268  0.1417910
## 2     TRUE   114  0.3421053

tab. 4: Relationship between drinking behaviour and number of failed courses. The averaged number of failed classes is slightly higher, when students have a high alcohol consumption. That confirms our initial hypothesis.

3.4 Logistic regression

After visualising the data we know that the hypotheses for our four selected variables are supported by the data. But can we also prove that statistically. With logistic regression we will now explore statistically the relationship between the variables and the drinking behaviour again.

m<-glm(high_use ~ famrel + absences + failures + G3, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + absences + failures + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2195  -0.7992  -0.6770   1.1782   1.9915  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.02926    0.67814   0.043 0.965585    
## famrel      -0.22402    0.12525  -1.789 0.073669 .  
## absences     0.08174    0.02253   3.628 0.000285 ***
## failures     0.39080    0.19830   1.971 0.048752 *  
## G3          -0.04377    0.03794  -1.154 0.248623    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 435.52  on 377  degrees of freedom
## AIC: 445.52
## 
## Number of Fisher Scoring iterations: 4

Amongst the four selected variables the number of absences is most strongly correlated with the high alcohol consumption.A p-value < 0.05 indicates a statistical significance. Hence also the amount of failed courses has a statistically significant correlation with high alcohol consumption. The quality of family relationships is iversively related to alcohol consumtion, wich means, the better the relationships the lower the alcohol consumption. However the correlation is not statistically significant. Furthermore there is no significant correlation between the final grades and the drinking behaviour according to the model. Only two of our hypotheses could be proven statistically.

Odd ratios and confidence intervals

The odds ratio quantifies the relationship between X and Y. In this case Y is high alcohol use and X are the explanatory variables used in our model: quality of family relationships, absences, failures and final grades.

OR <- coef(m) %>% exp()
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
coef(m)
## (Intercept)      famrel    absences    failures          G3 
##  0.02925904 -0.22402430  0.08174358  0.39080167 -0.04376795

Odds higher then 1 mean that x is positively related with success. Unfortunately this doesnt go anywhere so I cannot interpret the odds ratio.

3.5 Predictions

No we will only select the two variables with a significant relationship to alcohol consumption and write the model again. By the use of the logistic model a prediction of the probability of the high consumption of alcohol can be made. If the probability is higher then 50%, a high consumtion is predicted (= TRUE). The predictions are added to the data set as a new variable as can be seen in the data below that always shows the first ten observations of each question.

m <- glm(high_use ~ failures + absences, data = alc, family = "binomial")
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
head(alc, n=10)
##    school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1      GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2      GP   F  17       U     GT3       T    1    1  at_home    other
## 3      GP   F  15       U     LE3       T    1    1  at_home    other
## 4      GP   F  15       U     GT3       T    4    2   health services
## 5      GP   F  16       U     GT3       T    3    3    other    other
## 6      GP   M  16       U     LE3       T    4    3 services    other
## 7      GP   M  16       U     LE3       T    2    2    other    other
## 8      GP   F  17       U     GT3       A    4    4    other  teacher
## 9      GP   M  15       U     LE3       A    3    2 services    other
## 10     GP   M  15       U     GT3       T    3    4    other    other
##        reason nursery internet guardian traveltime studytime failures
## 1      course     yes       no   mother          2         2        0
## 2      course      no      yes   father          1         2        0
## 3       other     yes      yes   mother          1         2        2
## 4        home     yes      yes   mother          1         3        0
## 5        home     yes       no   father          1         2        0
## 6  reputation     yes      yes   mother          1         2        0
## 7        home     yes      yes   mother          1         2        0
## 8        home     yes       no   mother          2         2        0
## 9        home     yes      yes   mother          1         2        0
## 10       home     yes      yes   mother          1         2        0
##    schoolsup famsup paid activities higher romantic famrel freetime goout
## 1        yes     no   no         no    yes       no      4        3     4
## 2         no    yes   no         no    yes       no      5        3     3
## 3        yes     no  yes         no    yes       no      4        3     2
## 4         no    yes  yes        yes    yes      yes      3        2     2
## 5         no    yes  yes         no    yes       no      4        3     2
## 6         no    yes  yes        yes    yes       no      5        4     2
## 7         no     no   no         no    yes       no      4        4     4
## 8        yes    yes   no         no    yes       no      4        1     4
## 9         no    yes  yes         no    yes       no      4        2     2
## 10        no    yes  yes        yes    yes       no      5        5     1
##    Dalc Walc health absences G1 G2 G3 alc_use high_use probability
## 1     1    1      3        5  2  8  8     1.0    FALSE   0.2792474
## 2     1    1      3        3  7  8  8     1.0    FALSE   0.2459680
## 3     2    3      3        8 10 10 11     2.5     TRUE   0.5767118
## 4     1    1      5        1 14 14 14     1.0    FALSE   0.2154691
## 5     1    2      5        2  8 12 12     1.5    FALSE   0.2303651
## 6     1    2      5        8 14 14 14     1.5    FALSE   0.3340009
## 7     1    1      3        0 12 12 12     1.0    FALSE   0.2012844
## 8     1    1      1        4  8  9 10     1.0    FALSE   0.2622677
## 9     1    1      1        0 16 17 18     1.0    FALSE   0.2012844
## 10    1    1      5        0 13 14 14     1.0    FALSE   0.2012844
##    prediction
## 1       FALSE
## 2       FALSE
## 3        TRUE
## 4       FALSE
## 5       FALSE
## 6       FALSE
## 7       FALSE
## 8       FALSE
## 9       FALSE
## 10      FALSE

No we will compare the prediction to our target variable via cross tabulation.

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   258   10
##    TRUE     99   15

The cross tabulation shows that there are 10 values that are FALSE in the original data and predicted as TRUE from the model. Furthermore there are 99 cases were the students said TRUE and the model predicted a false. In 273 cases the model predicted the correct answer. Hence the total proportion of inaccurately classified individuals (= the training error) is 28.53%. Unfortunately the model only predicted 71,47% of the answers correctly. I would say that this is a relatively low amount and the model would need to be adjusted, for example by including more variables that have a strong relation to the drinking behaviour. Finally we can also see the data visualized, though I find the cross tabulation more comprehensible.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2853403
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()


RStudio Exercise 4 - Clustering and classification

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

4.1Exploring the dataset “Boston”

There are many data sets already loaded in R or included in the package installations. In this exercise we will practise our analysis on a data set called “Boston” from the “MASS” package. The data set concernes the housing values in the suburbs of Boston. It has 14 varibales with 506 observations, concerning living standards, like accssability and number of rooms.
The description of the variables are:
crim -per capita crime rate by town.
zn -proportion of residential land zoned for lot over 25,000sq.ft.
indus -proportion of non-retail business acres per town.
chas -Charles River dummy variable (=1 if tract bounds river; 0 otherwise).
nox -nitrogen oxides concentration (parts per 10 million).
rm -average number of rooms per dwelling.
age -proportion of owner-occupied units built prior to 1940.
dis -weigthed mean of distances to five Boston employment centres.
rad -index of accessibility to radial highways.
tax -full-valua property-tax rate per 0,000$.
ptratio -pupil-teacher ratio by town.
black -1000(Bk-0.63)^2 where Bk is the proportion of blacks by town.
lstat -lower status of the population (percent).
medv -median value of owner-occupied homes in $1000s.

dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

All the observations of the variables are numerival except for indus and rad, which are integer.

Now, lets have a look at the summary of the distributions of the variables. Here we cann see for example, that the mean crime rate is 3.61, whereas the median is 0.25 and the values range between 0.00 and 88.9. The mean of the pupil/teacher ratio is 18.46, which I find surprisingly low.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

With the following barplots we can get a graphical overview of the distribution of the data withing the variables.

library(ggplot2)
library(tidyr)
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Using the pairs function we can see how the variables are distributed and derive first assumptions about their relationships to each other. However, there are too many variables to get an overview with one glance.

pairs(Boston)

library(corrplot)
## corrplot 0.84 loaded
cor_matrix<-cor(Boston)
cor_matrix %>% round(digits=2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

A better option to investigate the relationship of the variables is the corrolationplot. The colour indicates what kind of correlation we have:
red - negative correlation
blue - positive correlation
The size of the cirlce and the intensity of the colour indicate the correlation coefficient.
strong colours - corrolation coefficient is high
light colours - corrolation coefficient is low
The strongest correlation can be found betwenn the index of accessibility to radial highways and the full-valua property-tax rate per 0,000$. The weighted mean of distances to five Boston employment centres correlates strongly with age, nitrogen oxide concentration and the proportion of non-retail business acres per town.

4.2 Data wrangling

Standardizing the dataset and creating a factor variable for the crime rate. As the data contains only numerical values, we can use the scale() function to standardize the whole dataset. Furthermore we are dividing the dataset into a train and a test set with 80% and respectively 20% of the observations per variable, which are randomly selected.

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

The variable crim describes the per capita crime rate by town.

class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
head(boston_scaled)
##           zn      indus       chas        nox        rm        age
## 1  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
##        dis        rad        tax    ptratio     black      lstat
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324
## 4 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708
## 5 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866
## 6 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909
##         medv crime
## 1  0.1595278   low
## 2 -0.1014239   low
## 3  1.3229375   low
## 4  1.1815886   low
## 5  1.4860323   low
## 6  0.6705582   low
n <- nrow(boston_scaled)
ind <- sample(n, size = n*0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
dim(test)
## [1] 102  14
dim(train)
## [1] 404  14

Fitting a linear discriminant analysis to the training set. Crime is a target varaible and all the other variables are predictors.

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2524752 0.2450495 0.2450495 
## 
## Group means:
##                   zn      indus         chas        nox          rm
## low       0.86483259 -0.9023069 -0.120902136 -0.8570283  0.43956568
## med_low  -0.08074812 -0.3311175 -0.002135914 -0.5804799 -0.12514426
## med_high -0.38156320  0.2460322  0.204895203  0.4187792  0.08762615
## high     -0.48724019  1.0171737 -0.033716932  1.0406471 -0.35647105
##                 age        dis        rad        tax     ptratio
## low      -0.8211460  0.8453500 -0.6936501 -0.7705826 -0.34676434
## med_low  -0.3563129  0.3790438 -0.5483812 -0.5020242 -0.05916205
## med_high  0.4465741 -0.3951306 -0.4366394 -0.3040111 -0.30699364
## high      0.7994240 -0.8443661  1.6375616  1.5136504  0.78011702
##               black       lstat         medv
## low       0.3798850 -0.76313608  0.505895177
## med_low   0.3173297 -0.13465682  0.003468435
## med_high  0.0912241  0.02573923  0.161944008
## high     -0.6952551  0.83945692 -0.688784912
## 
## Coefficients of linear discriminants:
##                 LD1         LD2        LD3
## zn       0.06089238  0.59913318 -1.0196710
## indus    0.09612264 -0.31872537  0.2542926
## chas    -0.07537586 -0.07894164  0.1521931
## nox      0.34311659 -0.77432542 -1.3343262
## rm      -0.09902015 -0.10571859 -0.2413598
## age      0.17213446 -0.29763305 -0.2115453
## dis     -0.05464855 -0.26691832  0.1308873
## rad      3.34787750  1.07413986 -0.1318456
## tax      0.04436256 -0.12169707  0.6373323
## ptratio  0.09966469  0.06363956 -0.3658725
## black   -0.13526602  0.09404649  0.1351621
## lstat    0.23803183 -0.29201064  0.4068448
## medv     0.20922800 -0.41113180 -0.1545870
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9515 0.0363 0.0123
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Predict LDA

correct_classes<-test$crime
test<-dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15       7        1    0
##   med_low    3      17        4    0
##   med_high   0      10       15    2
##   high       0       0        0   28

K-means Clustering

#reloading
data(Boston)

#scale dataset and make data frame
boston_scaled1<-as.data.frame(scale(Boston))

For calculating the distances between the obeservations we use the most common, euclidean distance.

dist_eu <-dist(boston_scaled1)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000

Now we apply K-means to the dataset. We start with a number of 4 cluster centers.

km <- kmeans(boston_scaled1, centers = 4)

However, we want to determine the optimal number of clusters.

set.seed(123)

#set the max number of clusters to be ten
k_max<- 10

#calculate total WCSS
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

#visualizing WCSS
qplot(x = 1:k_max, y = twcss, geom = 'line')

km <- kmeans(boston_scaled1, centers = 2)

#visualize Kmeans
pairs(boston_scaled1, col = km$cluster)

pairs(boston_scaled1[6:10], col = km$cluster)

Bonus

At first we scale the Boston dataset again.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
boston_scaled2 <-scale(Boston)
summary(boston_scaled2)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Now we perform the K-means on the standardized dataset.

km2 <- kmeans(boston_scaled2, center = 3)
boston_scaled2 <- as.data.frame(boston_scaled2)

Finally we perform Linear Discriminant Analysis, using the clusters as target classes, including all the variables from the Boston data in the LDA model.

lda.fit2 <- lda(km2$cluster~., data = boston_scaled2)

Visualizing the results, we can observe…

# function for the lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# setting target classes as numeric
classes <- as.numeric(km2$cluster)

# plotting the lda results
plot(lda.fit2, dimen = 2, col=classes, pch= classes)
lda.arrows(lda.fit, myscale = 1)

Super bonus

We are running the code bellow for the scaled train data that we used to fit the LDA.It creates a matrix product, which is a projection of the data points.

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

Now we are installing and accessing the plotly package in order to create a 3D plot from our data.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

We draw the same plot again, but set the crime classes to be the colours.

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)

Drawing the same plot again with the colours matching the clusters of the K-means.

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$cl)